Succinct Suffix Arrays Based on Run-Length Encoding
نویسندگان
چکیده
A succinct full-text self-index is a data structure built on a text T = t1t2 . . . tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2 . . . pm in T , and is able to reproduce any text substring, so the self-index replaces the text. Several remarkable self-indexes have been developed in recent years. They usually take O(nH0) or O(nHk) bits, being Hk the kth order empirical entropy of T . The time to count how many times does P occur in T ranges from O(m) to O(m log n). We present a new self-index, called run-length FM-index (RLFM index), that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The index requires nHk log2 σ +O(n) bits of space for small k. We then show how to implement the RLFM index in practice, and obtain in passing another implementation with different space-time tradeoffs. We empirically compare ours against the best existing implementations of other indexes and show that ours are fastest among indexes taking less space than the text.
منابع مشابه
Counting Suffix Arrays and Strings
Suffix arrays are used in various application and research areas like data compression or computational biology. In this work, our goal is to characterize the combinatorial properties of suffix arrays and their enumeration. For fixed alphabet size and string length we count the number of strings sharing the same suffix array and the number of such suffix arrays. Our methods have applications to...
متن کاملTime and Space Efficient Lempel-Ziv Factorization based on Run Length Encoding
We propose a new approach for calculating the Lempel-Ziv factorization of a string, based on run length encoding (RLE). We present a conceptually simple off-line algorithm based on a variant of suffix arrays, as well as an on-line algorithm based on a variant of directed acyclic word graphs (DAWGs). Both algorithms run in O(N + n log n) time and O(n) extra space, where N is the size of the stri...
متن کاملEfficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)
MOTIVATION Over the last few years, methods based on suffix arrays using the Burrows-Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the dataset being searched. Meanwhile, algorithmic development for genotype data has concentrated on stati...
متن کاملSuccinct representations of lcp information and improvements in the compressed suffix arrays
We introduce two succinct data structures to solve various string problems. One is for storing the information of lcp, the longest common prefix, between suffixes in the suffix array, and the other is an improvement in the compressed suffix array which supports linear time counting queries for any pattern. The former occupies only 2n + o(n) bits for a text of length n for computing lcp between ...
متن کاملImproving Text Indexes Using Compressed Permutations
Any sorting algorithm in the comparison model defines an encoding scheme for permutations. As adaptive sorting algorithms perform o(n lg n) comparisons on restricted classes of permutations, each defines one or more compression schemes for permutations. In the case of the compression schemes inspired by Adaptive Merge Sort, a small amount of additional data allows to support in good time the ac...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Nord. J. Comput.
دوره 12 شماره
صفحات -
تاریخ انتشار 2005